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Abstract 

We consider priors for several nonparametric Bayesian models which 
use finite random series with a random number of terms. The prior is 
constructed through distributions on the number of basis functions and 
the associated coefficients. We derive a general result on the construc- 
tion of an appropriate sieve and obtain adaptive posterior contraction 
rates for all smoothness levels of the function in the true model. We 
apply this general result on several statistical problems such as signal 
processing, density estimation, nonparametric additive regression, clas- 
sification, spectral density estimation, functional regression etc. The 
prior can be viewed as an alternative to commonly used Gaussian pro- 
cess prior, but can be analyzed by relatively simpler techniques and in 
many cases allows a simpler approach to computation without using 
Markov chain Monte-Carlo (MCMC) methods. A simulation study was 
conducted to show that the performance of the random series prior is 
comparable to that of a Gaussian process prior. 

1 Introduction 

Bayesian methods have been widely used in the nonparametric statistical lit- 
erature. In recent years, there is a lot of development on asymptotic prop- 
erties of pos t erior distributions. General results h ave been establishe d by 



Ghosal et al.l |1999| for post erior consistency and by I Ghosal et al.l |2000j and 



Shen and WassermanI |200ll | for posterior convergence rates. It is well known 
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that the optimal convergence rate for the estimation of functions is determined 
by the smoothness level of functions, e.g. , the optim al rate of estimating a 



univariate a-smooth function is n Stond . Il982| . where n is the sample 



size. Since the smoothness parameter a is usually unknown in practice, a nat- 
ural Bayesian procedure will consider assigning a prior distribution on a. It is 
then of interest to investigate if the resulting mixture prior leads to posterior 
distributions that has optimal posterior convergence rates simultaneously for 
all values of a. Such procedures are called rate-adaptive. 



Progress has been made in studying rate-adaptive procedures. iBelitser and Ghosal 



2003| considered the problem of estimating a signal with Gaussian white 



noise and showed that the posterior rate automatically adapts to the un- 
known smoothne ss condition if the smooth ness parameter only takes values 
in a discrete set. iGhosal et al.l 20031 . |2008| showed that appropriate mixture 



of certain priors, such as those based on sphne expansions, yield optimal pos- 
terior rates for a countable range of smoothness parameters for density esti- 
mation. A simila r work has be e n done in a nonp arametric regression setting 
using wavelets in iHuangl 2004| . IScricciold 2006( obtained adaptive rates for 



density estimation problems when the smoothness parameter belongs to a dis- 
crete set. The basic idea behind this approach is to use the optimal dimension 
Jn,a of the model for a given smoothness level a and sample size n obtained 
from some appropriate rate equation to construct "optimal priors" „ for 
each a, and then mix countably many of them to construct the mixture prior 
which adapts for all these countably many smoothness levels. Alternatively, 



van der Vaart and van ZantenI |2009| constructed a prior based on a randomly 



rescaled smooth Gaussian process, which automatically adapts for a continuous 
range of smoothness parameters. 

In nonparametric Bayesian literature, a prior on a function i s usually con- 

structed via a stochastic process, (e.g., a Gaussian process prior in Ivan der Vaart and van Z ant en 



20081 12009| ) or by expanding a function in a basis of functions or by convo- 



luting a kernel with a random measure. In our study, we focus on the basis 
expansion approach by putting a prior on the coefficients of basis functions 
and the dimension of the basis function spaces. There are several advantages 
of using a prior directly on model dimension J rather than on the smoothness 
level. First, this bypasses the need for specifying the optimal dimension Jn^a in 
practice, which are given by posterior convergence theorem only up to a con- 
stant multiple. In a sense, this is more natural from a Bayesian point of view 
and is a common method used by practitioners. More importantly, assigning 
a prior directly on J allows us to obtain adaptation for all smoothness levels 
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in an interval rather than for only a countable number of smoothness levels. 



A similar idea was used in Babenko and Belitser 2010 for the white noise 



models, which is equivalent to an infinite dimensional normal model. They in- 
troduced the idea of the oracle dimension for every parameter value defined by 
the minimizer of risk in a class of estimation problems induced by the dimen- 
sion. The oracle dimension is used to slice the parameter space into different 
smoothness levels. They assigned a prior distribution on the oracle dimension 
and the projection of the infinite dimensional mean vector on the finite dimen- 
sional oracle. They showed that the risk of the Bayes estimator satisfies some 
desirable oracle inequalities, which lead to a complete adaptation of a Bayes 
estimator. Interestingly, the oracle inequality gives adaptation simultaneously 
for many different smoothness classes. While finishing writing this article, we 
also became aware of two very recent works which also use this same idea of 
direc tly assigning a prior distr i butio n on the number of terms in a basis expan- 



sion. iRivoirard and Rousseau! |2012| exclusively considered the deri sity estima- 
tion problem using wavelet basis. Ide Jonge and van ZantenI 2012| considered 
a general class of inference problems using spline basis and Gaussian priors on 
coefficients, and hence the resulting priors are mixtures of finite dimensional 
Gaussian processes. We formulate one general theorem in an abstract set- 
ting suitable as a prelude for many different inference problem where we allow 
arbitrary basis functions and arbitrary multivariate distributions on the coef- 
ficients of the expansion. Thus the resulting process induced on the function 
need not be Gaussian, and can accommodate from bounded support to heavy 
tailed distributions. The resulting rate obtained in the abstract theorem de- 
pends on the smoothness of the underlying function, approximation ability of 
the basis expansion used, tail of the prior distribution on the coefficients, prior 
on the number of terms in the series expansion, prior concentration and the 
metrics being used. The general theorem then gives rise to adaptive posterior 
convergence rates for different estimation problems. 

We illustrate the implication of the abstract theorem for the white noise 
model, density estimation on compact interval or the real line, nonparametric 
normal, Poisson and binary regression, spectral density estimation of a station- 
ary time series and functional regression model using the B-spline ba sis as one 
main example. B-splines have been well studied by mathematicians de Boorl . 



2001| and have been used in statistics as well. Non-negativity, near orthogonal- 



ity, summing to one and arbitrarily high smoothness level of B-spline functions 
are collectively the reasons for the popularity of B-spline basis. The idea is to 
approximate a function that is of interest as a linear combination of the spline 
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functions. Then the estimation of the function becomes e quivalent to the esti 



mation of the coefficients in the B-sphne basis expansion [Truong et aLl . 12005 



In some cases, we do not approximate the true function directly. Instead, we 
consider a transformation that is needed to satisfy certain constraints. For 
example, for density estimation, an exponential transformation along with a 
normalization step is often used as the functions are requir ed to be nonneg- 



ative and integrate to one ptond . Il990l . iGhosal et al.l . |2000| . We also discuss 



adaptation on Soblev and Besov spaces besides commonly used Holder spaces 
to quantify smoothness. For nonparametric normal regression, we also remove 
a commonly used condition that the variance is bounded away form zero. Fur- 
ther, for the density estimation problem, we allow a flexible general link func- 
tion instead of only the exponential link. By restricting coefficients to (0, oo), 
we do not even need to use a link function. The additional flexibility may be 
valuable in flnding computational algorithms. 

The random series expansion prior may be regarded as a valuable alterna- 
tive to the Gaussian process prior. The Gaussian process has received a lot 
of at tention i r i the literature on all aspects, fro m modeling a prior distribu- 



tion 



Leonard, 



1978. 

Tokdar and GhoshI 



Lend . 1988| to computation Tokdarl. 2007 1 to asymptotic s 



20071. iGhosal and Rovl l2006U Choi and SchervishI [2007 



van der Vaart and van Zanten 



20071. 



2008 



2009 



Castillo 2008, 2012 



to ap- 
plications in spatial statistics Banerjee et al.l . l2008| and elsewhere. Asymptotic 
properties of posterior distributions based on Gaussian process priors are pri- 
marily driven by the structure of its reproducing kernel Hilbert space. While 
the elegant result of Ivan der Vaart and van ZantenI 2009l | established that ap- 
propriately randomly rescaled Gaussian processes lead to posterior that au- 
tomatically adapts to the unknown smoothness, in this paper we show that 
the same property also holds for random series priors using relatively elemen- 
tary techniques. Computationally, Gaussian processes are relatively difficult to 
deal with. Except for Gaussian Markov random fields for wh ich the integrated 



nested Laplace approximation method has been developed |Rue et al.l . 12009 



the general approach to computation is to approximate the given Gaussian 
process by one that is generated by finitely many normal variable, obtained 
by conditioning t he original pro cess at a number of knots, which needs to be 
sufficiently large Tokdarl . l2007l | . The approach adaptively chooses the knots 
through a reversible jump Markov chain Monte-Carolo (RJMCMC). While 
RJMCMC can also be used for random series prior, we came up with a method 
that poses a conjugate-like prior for the model and hence avoids the use of 
MCMC as the posterior can be represented analytically. When the sample size 
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n is relatively small (e.g. n = 10), the exact values can be obtained. When the 
sample size is large, we use a direct sampling strategy. Thus at least concep- 
tually, the random series prior gives rise to a more straightforward approach 
to computation. It may be noted that Gaussian process and random series 
priors are intimately related in two ways - normal prior on the coefficients of 
a random series give Gaussian processes, and Karhunen-Loeve expansion of a 
Gaussian process express it as a random series with basis consisting of eigen- 
functions of the covariance kernel of the Gaussian process. Thus random series 
may be regarded as a more general and flexible alternative to Gaussian process 
that allows a more straightforward approach to computation and asymptotics. 

There are two key steps in proving Bayesian adaptation results. The ffist 
is to construct a sieve that contains the true underlying model while its size 
is well controlled. The size is usually characterized by th e existence of tests 



or illustrated by entropy calculation [Ghosal et al.l . |2000| . The other step is 



to construct an approximation of the true function while its approximation 
accuracy increases appropriately with the increasing level of smoothness. We 
derive a general result on the existence of an appropriate sieve that can later 
be used in showing asymptotic results. In particular, we illustrate its use on a 
few statistical models using B-spline functions as the approximation technique. 

The paper is organized as follows. We introduce some notations in Section 
2. In Section 3, we present main theorems of random series and B-spline 
functions. In Sections 4, 5, 6 and 7, we apply the theorems to a variety of 
statistical problems and derive the corresponding posterior convergence rates. 
In Section 8, we extend our discussion from Holder class to Soblev and Besov 
function spaces. Finally, a numerical study is presented in Section 9. 



2 Notations 

Denote N = {1, 2, . . .}, No = N U {0} and Qj = {(a;i, . . . , xj) : ^{^^ Xi = 
i , X 1 , . . . , Xj > 0}. Let ||a;||p = {J^ti l^ipj^/P stand for the i „-norm of a 
vector X G M*^; 1 < p < oo and ||a:;||oo = niaxi<j<d Similarly, we define 
ll/llp = {/ \ f{x)\PdxY^P and ||/||oo = sup^. as the Lp-norm of a function 

/ for 1 < p < oo. Define function spaces Lp = {f : \\f\\p < oo}. For a 
probability measure G, define ||/||p,G = {/ \ f{x)\^dG{x)Y^P. 

We define 6x as the point mass probability distribution at point x. Define 
an indicator function of a set A as Il{^}. 

We define the a-Holder class C" as the collection of functions / that has 
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bounded derivatives up to the order Uq, which is the largest integer strictly 
smaller than a, and the ao-th derivative of / satisfies the Holder condition 

- < C\x - i/l™ (2.1) 

for constant C > and any x, y in the support of /. 

For two /c- dimensional vector s and t, define \s\ = Yli=i^i^ = 11^=1 "^i' 
and s* = nli s^- 

We use < for inequality up to a constant multiple, where the underlying 
constant of proportionality is universal or not important for our purposes. If 

two functions / and g satisfy / ^ (? ^ ./, wc shall write f ^ g. 

The Hellinger distance q) and the Kullback-Leibler (KL) divergence 
K{p, q) between two densities p and q arc commonly used in statistics. They are 
respectively defined by q) = J {\/p— y/Q)^dfj, and K{p, q) = J plog{p/q)dfj.. 
Also define the second order KL divergence by V{p,q) — Jplog^(p/g)d//. 
We define a KL ball around p with radius e as /C(p, e) = {/ : K(p, f) < 
e^V{pJ)<e'}. 

We use D{e,T,d) to denote the packing number, which is defined as the 
maximum cardinality of an e-dispersed subset of T with respect to distance d. 

3 General results 
3.1 Main theorem 

We consider a random variable J taking values in N. For each J G N, we 
consider a set of basis functions £ = iCijC2, ■ ■ ■ defined on a measurable 

space Q. A prior is assigned on J and the coefficients of basis functions — 
(^1, . . . , 6j)^ as follows: 

(Al) The prior for J satisfies U(J > j) < A{j) and U{j < J < Cqj) > B{j) 
when i is sufficiently large for some constant Cq > 1. The functions 
A{j) and B{j) are assumed to be nonnegative and strictly decreasing to 
when j oo. 

(A2) Given J, we consider a J-dimensional joint distribution Gi as the prior for 

= (01, djY. Assume d satisfies Gi{\\0-0o\\-2 < e) > exp{-C2 Jlog(l/e)} 
for some 0o G M"^, constant C2 > and sufficiently small e > 0. 
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The priors on J and d are allowed to depend on n provided the constants 
appearing in (Al) and (A2) are free of n. However, to simplify notation, 
we shall drop the subscript n from 11. In the following, with some abuse of 
notation, we will use 11 as the prior distribution on J and as well as for the 
induced prior distribution on functions 0^^. 

We define two distance metrics di and d2 on Q satisfying the following 
conditions: 

di{eli,eli) < n^i - ^slbMj), (s.i) 
d2{eli,el$,) < \\e,-e2h/h{j) (3.2) 

for some positive functions a(-) and and every j G N, 81,62 G W . 

Now we state the main theorem, which can be regarded as a master theo- 
rem where the proofs of required conditions for posterior convergence rates 
for various inference problenis are established similarl y to Theorem 2.1 of 



van der Vaart and van ZantenI |2008| and Theorem 3.1 of lvan der Vaart and van Zanten 
2009| . 



Theorem 1. Let e„ > e„ be two sequence of positive numbers satisfying e„ — )■ 
and ne^ > 1 as n 00. For a function wq, let there exist two series of positive 
numbers Jn and Mn, a strictly decreasing, nonnegative function e(-) and a 
Oqj G W for any j G N, such that 

\\0o,j\\oo < Mn, (3.3) 

d2{wo,el^^)<e{j), (3.4) 

max{e-^ (en), nel} < Jn, (3.5) 

AiJn) < exp{-4ne2}, (3.6) 

log{l/5(J„)}< J„log(l/e„), (3.7) 

Jn log Jn + Jn log < nel (3.8) 

n(6> ^ [-M„,M„p) <exp{-Anel},l<j < Jn, (3.9) 

Let yVj„^M„ = {w = 6^^ : 6 G < Jn, ll^lloo < Mn}- Then the following 
assertions hold: 

logL'(e„,Wj„,M„,rfi) < nel, (3-10) 
U{W^Wj^,mJ < exp{-4n6t}, (3.11) 

-\ogU{w = 0^^:d2{wo,w)<2en} < J„log(— ^). (3.12) 
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Proof. We first verify fl3.1Up . using tlie definition of packing number, tlie as- 
sumptions on Mn, Jn and f l3.ip . we obtain: 

log D{en,yVj„,M„,di) 

J n 

< log Z}(e„/a(j), [e e n„ ||6>|U < MJ, || ■ l^)) 

< log J„{ } 

< J„log )<ne^ (3.13) 



e 

Next, to verify fl3.1ip . observe the following: 

iiiw^Wj„,M„) < n(j> j„) + 5^n(6i^ [-M„,M„nn„(j = j) 

< A{Jn) + exp{-Anel} 

< exp{-Anel}. (3.14) 

For fl3.12p . using (13. 3p and (13. 4p . since d2{wQ, Oqj^) < e{j) < e„ for all j > J„, 
we have 

Ii{w : d2iwo, 61^0 < 2en} > n(J„ < J < Ci J„)Gi(||6> - 6>o||2 < &(-^n)en) 

> 5(J„)exp{-C2J„log(— ^)} (3.15) 

for some constants Ci > 1 and C2 > 0. By taking a negative logarithm on 
both sides and using (13. 7p . it follows that (13. 9p holds. Hence the proof is 
complete. □ 

Remark 1. The choice of M„ is closely related to the prior distribution Gi 
on 6. For example, if we restrict the prior on a bounded set, then M„ can be 
chosen as a large constant such that the left hand side of (13. 9p equals to 0. If 
we assume exponential decay condition Gi{0 ^ [— M, M]-^) < exp{— cajM'^'*} 
for some constants C3, C4 > 0, all j G N and sufficiently large M, then M„ can 
be chosen polynomial in n. 

Remark 2. Conditions (13. 4p . (13. 5p . (13. 6 p and (13. 7p in Theorem [1] require a 
sufficiently large J„ in order to have sufficiently good approximation to Wq. In 
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other words, J„ controls the bias of the modeL Meanwhile, Conditions f l3.8p 
and fl3.9l) state that J„ should not be too large if the complexity of the model 
is to be controlled. When studying Bayesian asymptotic properties, a balance 
between bias and complexity needs to be established to obtain the optimal 
posterior convergence. 

Next, we give some examples to illustrate the use of Theorem [TJ 

Example 1. Fourier trigonometric series 

Choose the basis {cos jx, sin jx, j G N}, then for a function wq G C", we have 
e(J) = C(\okJ )/J°' for some constant C and d2 as the supremum distance 
Jackson, 1930j . Therefore, we can choose Jn = n^^^'^"^^\\ognY^^'^°'^^\ e„ 



^-a/{2a+i) (j^g^y/{2a+i) ^ Tlic fuuctious A{-) and B(-) in the prior can be chosen 
as exponential functions A{x) = exp{— cio;} and B{x) = exp{— C2X log x} for 
some positive constants ci and C2. Then the rate e„ is n~°'^^'^°'~^^\lognY^^'^°''^^''~^^^'^ . 

Example 2. Bernstein polynomials 

We consider the Bernstein polynomial prior proposed by Petrone 1999bl la|. 



Consider a continuously differentiable density function wq with bounded sec- 
ond derivative, the approximation property of Bernstein polynomials to wo is 
e(J) = C/J f or some universal constant C and d2 as the supremum distance 
' Lorenzl . [l953) . We can choose J„ = n^^^, en = n and again same choices 



for A{-) and B{-) as in Example 1. T he rate £ „, is n ^/^(logn)^/^, which has 



the same polynomial power as given in iGhosal ^00 11 



Example 3. Polynomial basis 

Consider the orthogonal polynomials as the approximation tool for wq € C"([0, 1]), 
we have e(J) = CJ'"" for s ome universal constant C and d2 as the L2- 
distance (e.g.. Theorem 6.1 of Hesthaven et al. |2007j |). Then we can choose 



J„ = ^i/(2a+i)^ = ^-a/{2a+i) gg^^g ^(^.^^ ^(^.^ Example 1. The 

rate e„ will be multiplied by some power of log n, where the power depends 
on the statistical problem. 

Note that although the approximation is given for L2-distance instead of 
supremum distance, the adaptive rate can still be obtained under mild condi- 
tions, see Remarks [3 El El [13 and [H 

Example 4. B-splines 

Choose B-spline as the basis functions, then for wq G C"([0, 1]), we have e( J) x 
J^" for d2 as the supremum distance. Then we can choose J„ = n^/^'^°'+^\ 
(in = ?T,~"/(^"+-'^^ and e„ as ?7,~"/{2o^+i) multiplied by some power of logn, where 
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the power depends on the statistical problem. More details are given in Section 
3.2. 



Example 5. Wavelets 

We consider a multiresolution wavelet: 



fcez j=o kez 



(3.16) 



where are father wavelets, ip are mother wavelets and ji G N. We put priors 
on ji and wavelet coefficients a and (3. It has been show that, for wq G C", 
the approximat i on err or is e(ji) = 2"-'^" for d2 as the supremum distance [e.g. 
Gine and Nickll 2009| ]. Meanwhile, if Wq has a compact support, the number 
of nonzero coefficients is 0{2^^). Hence we apply Theorem [1] for J = 2^^ and 
choose Jn = ^ ^_„/(2a+i) ^(^.^^ ^(^.) Example 1. The 



results for white noise models in 


Lian 


2011 


regression models in 


Rivoirard and Rousseau 



and for density estimation and 



3.2 B-splines 



We consider the B-spline model as described in Ide Boorl |2001j . For a given 
compact interval on the real line, we divide it into K equally spaced subintervals 
and then define a class of B-spline basis functions that are q times differentiable. 
Then it can be shown that these spline functions form a J-dimensional linear 
space, where J is defined as J = q+K—1. In our study, these B-spline functions 
are used to approximate the underlying data generation scheme (e.g. true 
densities). The approximation ability of the B-spline functions is determined 
by the smoothness level of the data a and the number of spline basis functions 
J under the condition that q > a. Denote B as the column vector of B-spline 
basis functions. In the following discussion, we assume q is chosen large enough 
such that q > a holds for every statistical problem we are interested in. 

Lemma 1.(1) For any function f G C"([0, 1]), < a < g, there exists G M"' 
and a constant C > such that 

||/-^^B||oo<CJ--||/(-)|U. (3.17) 

(2) Further, if f > and J > Jq, where Jq is a sufficiently large constant that 
only depends on f and q, then every element of 6 can be chosen to be positive. 
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Proof . The first part is a well-known spline approximation result de Boorl . 



2001 



F or the second assertion, assume / > e for some e > 0, use the result in 



de Boorl 200ll Chap. 6], for each 6i, there exists a universal constant Ci that 
depends only on q and /, such that \9i — c\ < Ci sup2.g[^.^^^^.^^_^] \ f{x) ~ c| for 
any choice of constant c. Here tj+i and tj+g-i are {i + l)-th and {i + q — l)-th 
knots. Choose c = inf ^.^ _j] /(x) > e, then the right hand side is bounded 
by Ci(g/J)". Choose J > g(d/e)i/°, we have 9i> c- Ci(g/J)" > 0. □ 

Remark 3. In part (2), the condition / > is crucial. If we approximate a 
nonnegative function / u sing nonnegative coe f ficien ts 6, then the approxima- 
tion error is only 0{J~^) |de Boor and Daniel . 1974 1. which does not adapt to 



the smoothness level beyond 1. 

For wq G C", we have e( J) = J~" for d2 chosen as the supremum distance. 
This leads to the following choice of priors for J and the coefficients of basis 
functions 6 = (6'i, . . . , 9j)^: 

(Bl) The prior for J satisfy n(J > j) < exp{— cqJ log*^ j} and n„(j < J < 
Coj) > exp{— cij log*^ j} for some constants co,ci > 0, < ti < ^2 < 1 
and Co > 1 when j is sufficiently large. 

(B2) Given J, we consider a J-dimensional joint distribution G2 as the prior 
for = {ei,...,ejf. Assume G2 satisfies 02(6 : \\0 - 6>o||2 < e) > 
exp{— C2Jlog(l/e)} for some £ , constant C2 > and small e. 

(B3) In particular, if i^o > 0, then we allow the prior for being constructed 
on (0, oo)-^ satisfying n{0 : \\0 - 6»o||2 < e) > exp{-C2Jlog(l/e)}. 

Remark 4. We give some examples of such priors. Geometric, Poisson and 
negative binomial distributions satisfy Condition (Bl). Normal, gamma, ex- 
ponential, D i richle t distributions satisfy Condition (B2); see Lemma 6.1 of 



Ghosal et al. 2000 for the last conclusion. 



Using the relation ||6»f 5 - 6>|^5||i < - ^ab/v^ and \\0jB-0^B\\^ < 
\\0i — 02\\2/ y/j for 01,02 £ I^"'; we get the following conclusion from Theorem 
1. 

Theorem 2. Let e„ > e„ be two sequence of positive numbers satisfying e„ — )■ 
and ne^ > 1 as n —t- 00. // there exist Jn and Mn satisfying the following 
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conditions: 



max{e;'/",n4} < J„, (3.18) 

cilog*! J„ < log(l/e„), (3.19) 

J„log(^^) <ne2, (3.20) 

n„(6>^ [-M„,M„]^) <exp{-4ne2},l < J< J„, (3.21) 

Let >Vj„,M„ = {tf^ = ^"^S : eR-^,J < Jn, ll^lloo < then the following 

assertions hold: 

logD(e„,>Vj„,M„,||-||i) < nel (3.22) 

II{W^Wj^,mJ < exp{-4net}, (3.23) 

-logn{||w7o-6>^B|U<2e„} < JJog{-^). (3.24) 



4 Gaussian white noise model 



We consider a Gaussian white noise model 

dX(t) = f(t)dt + -^dW(t), 0<t<l, 
'n 



(4.1) 



where X{t) is the observed signal, f{t) is the unknown signal and W{t) is 
a standard Wiener process. Let 0i, i = 1,2, .. . be an orthonormal basis in 
L2[0, 1]. Assume / G L2[0,l], then this problem can be transformed into 
the estimation of the mean 6 = {61,62, ■■ ■) for an infinite-dimensional normal 
distribution as follows: 



Xi = 6i + ^, 1 = 1,2 



n 



(4.2) 



Here Xj's are independent observations, e^'s are i.i.d white noise variables that 
follows a normal distribution A^(0, 1). A symptot i c resu lts have been obtained 
in frequentist studies. For example, in Pinsker |l980j . the minimax conver- 
gence rate is shown to be n""/^^""*"^-* for L^-c l ass w ith respect to quadratic 
risk. In a Bayesian study, iBelitser and Ghosall 2003j obtained the same pos- 
terior convergence rate for a discrete collection of smoothness parameters a. 
Babenko and Belitserl 2010| considered putting a prior on the oracle J, which 
is defined as the best cut-off 6',- = for all i > J such that the risk of 
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{Xjll(i < J) : i G M} is minimized. They showed that such an oracle estima- 
tor is minimax for a general smoothness class and hence obtained adaptation 
results as well. 

Assume the parameter = {61,62, ■■■) belongs to a Soblev type class 
S[a) = {6 : ^^"6*^ < Q} for some a,Q > 0. Here a can be viewed 

as a smoothness parameter. We consider the oracle J > such that 6i = for 
all i > J. The approximation error is 

j=J+l i=J+l 

Then by constructing appropriate priors on 6 and J as in Section 3.1, using 
Theorem [U for J„ = n^/('^"+'^\ e„, e„ = n~"/'^^"+-^) and ^2 as ^2-distance, we 
obtain the following theorem: 

Theorem 3. Suppose the true mean Oq G S{a) for some < a < q. The prior 
is constructed as in Section 3.1, then the posterior distribution of 6 converges 
at rate e„ = n~°'^^'^°'~^^^ {log nY^"^ with respect to i2-distance. 



5 Density estimation 



5.1 Density on a known compact interval 

In this section, we apply the general results to density estimation problems. In 
the frequentist study, optimal rate of conv ergence j^as been obtained 

for the maximum likelihood estimators in 



model has been studied in Ghosal et al 



2000 



Stone 1990 



convergence rate is shown to be 



n 



-a/{2a+l) 



A Bayesian log-spline 
where the optimal posterior 
When a is unkn own, the rate 



n" 



-a/(2a+l) 



2008 



up to an additional logarithmic factor is established in iGhosal et al. 



We first consider estimating a density function /q that is defined on a unit 
interval [0, 1]. A Bayesian estimator of / can be constructed by using B- 
spline functions through a nonnegative, monotonic link function g, i.e., pe = 
g{J2i=i (^iBi)/ /q g(Yld=i (^iBi) for G M'^ and J is given a prior on N. We can 
choose g as the identity function, and restrict the prior for on (0, oo)"^. If we 
choose g as the exponential function, then it gives the log-spline model. If g is 
a polynomial, pe is a rational function of 0. By using identity link function, it 
is possible to avoid the normalization altogether; see Section 5.3. 

The theorem of the posterior convergence rate is stated as follows: 
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Theorem 4. Consider n independent, identically distributed observations Xi, . . 
from the true density /o G C"[0, 1] and a < q. Suppose that fo is positive. If the 
prior is constructed as in Section 3.2, then the posterior distribution converges 
at rate e„ = n~"^^""'""'^^(logn)^/^ with respect to the Hellinger or the L2-distance. 

Proof. From the sphne approximation theory, given J, there exists 6q G M"' 
such that \\elB - g-\fo)\\oo < J"" This gives ||/o - g{OlB)\\^ < J"" By 



integration, we have |1 — /q g{OQB){x)dx\ < J Hence ||/n — P0„ 



< J- 



and h{fo,p0g) < J °. Using Lemma 8 of iGhosal and van der VaartI |2007a 
we have 



Kifo,pe,)<2h''{fo,pe,] 
Vifo,P0o)<h'ifo,Peo)(l + 



fo 



fi 







Peo 



< J 



<J 



-2a 



-2a 



{5.1] 



Now we use Lemma 7.4 in lGhosal et al.l [2008|, which states that the Helhgner 
distance on pg, 6 E M.'^ is equivalent to the L2-distance on 0i,O2 G M'^ divided 
by ^/J■. 



inf 

x,e 



1^1 — 62 
J 



A 1) < h^{pe^,Pe2) ^ sup 



\6i — O2 



A 1 



(5.2) 



Hence we can apply Lemma 7.6 in that paper and calculate the prior prob- 
ability on an L2-ball around 6q instead of a KL-ball around /q. We apply The- 
orem[2]for J„ x n^l^'^^+^) ^ M„ as a constant satisfying M„ > max{(yf~^(/o), /o}, 
e„ = ^-«/(2a+i) ^ ^-a/(2a+i)Qog^)i/2^ ^Q^g ^Yi&i (13:21) and ([EI]) to- 

gether imply H(/C(fn,e„) ) > expj— cne^j for some positive constant c. Then 
we apply Theorem 2.1 of Ghosal et al.l |2000 |. the proof is complete. □ 



Remark 5. If we assume a > 1, namely, /o is Lipschitz continuous, then 
the proof above can proceed using L2-approximation only. 



From \\elB 



g-\k)\\2 < we have \\f,-g{dlB)\\, < \\fo-g{e^B 
<2\\fo-gie^B)h<J- 



< J- 



Hence 



II /o 



Then h{fo,pe„] 



P0o\\2 

II /o P0OII2 ^ J "" since fo is bounded below. The KL divergences can be 
bounded similarly. This implies that we can use other basis expansions such 
as orthogonal polynomials instead of B-spline functions as long as we have 
L2- approximation results. 



14 



5.2 Density with unbounded support 

Let /o € C°(R). Consider a fixed monotonic link function \I/ : M — )■ [0, 1] to 
cliange tfie domain of the function to [0, 1]. Suppose that there is an interval 
[a, b] such that /o is bounded away from on it. Consider a pseudo-metric 
d{fi,f2) = \hiy) - h{y)\dy. By constructing a prior on / through the 
representation f{y) = g {0^ B {"^ (y))} and arguing as in Theorem 4, we obtain 
the same posterior convergence rates with respect to d. The same method also 
applies for half open intervals. 



5.3 Computation 



Consider the setting of Subsection 5.1. It is well known that [see ISchumakerl . 



20071 . Chap. 4] 



Bi{x)dx=l « = g,...,(J-g + l); (5.3) 

[ TOi) ^ = (^-^ + 2),..., J. 

Define scaled B- spline basis functions Bl = Bi/ Bi, i = 1, . . . , J so that 
J^B*{x)dx = 1, i = 1,..., J. 

We restrict the coefficients to satisfy Y2k=i ^fc = 1 and form a density 
/ = J2k=i^kBl. We put a Dirichlet prior on ~ Dir(ai, 02, ... , aj) given 
fixed J. Finally, we assign a prior 11 on J. Thus a prior on the density / 
is induced. Given the observations Xi, . . . ,X„ and a fixed dimension J, the 
posterior density of is a mixture of Dirichlet distribution: 



fc=i i=i k=i 

J J J n 



«l = l i„=l k=l s=l 

Thus, the posterior mean of / at a point x is 

ET=i n(^) Ejo=i Ei=i • • • Ei=i Io,n, m=i or' K=o o.bi {x.)de 
E;1i n(J) eJ,=i ■ ■ • El=i loen, nti or' Ut^ o^BUx^e ' 



(5.5) 
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where Xq stands for x. Define Ikj^ = J2i=o ^{^^ = k} and Ikj^i = ^^=1 ^{is 
k}. The quantity above can be simplified into 



J 3 3 3 



j=l io=l«i=l i,i=l fc=l s=0 i=l 



(5.6) 



3 3 

j=l ii=l «n=l fc=l s=0 i=l 

A basis function takes nonzero values only at q intervals, so the calculation 
involves a multiple of g"^^ loops. More details are given in Section 9. Similar 
expressions can be obtained for posterior moments, in particular, the posterior 
variance. 

Remark 6. Note that the adaptive Bayesian estimator has a connection with 
histogram and kernel methods. Namely, if g = 1, the sums over indices 
ii, . . . ,in in (15. 6 p will vanish. Hence it gives a histogram estima te of densi- 



ties, which is similar with the Bayes estimator in iGasparinil |1996| . Although 
doing so brings easy computation, it cannot adapt to the smoothness level 
greater than 1. Our estimator can also be viewed as a kernel estimator, where 
the kernel is induced by a discrete parameter and the kernel takes positive 
values only in a finite interval. When q and J are chosen larger, the kernel 
becomes more flat. 



6 Whittle estimation of spectral density 
6.1 Posterior convergence rates 

Consider a second order stationary time series {Xt,t G Z} with mean and 
autocovariance function 7,, = E{XtXt+r) ■ Assume J2r I'^A < 00. Define 
the spectral density of {XJ by /(A) = (27r)-i 7^6"^^^^ on [0,1]. Let 

/„(A) = (2?™)^^! Ylt=i Xte~**''^p be the periodogram. Instead o f using the true 
likelihood, which is complicated even for a Gaussian time series, IWhittld jl957j 



proposed using an approximate likelihood. Let u = |_n/2j and Uj = 2j/n, 
j = 1, . . . , I/. The Whittle likelihood is formed by pretending that Uj = Ini^jJj), 
j = 1, . . . , 1/ with means f{ujj) are exponentially distributed. The advantage of 
using the Whittle likelihood is that it is an explicit function of the spectral den- 
sity /, rather than being expressed in terms of infinitely many autocorrelation 
coefficients. 
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A nonparametric Bayesian method has been used by lChoudhuri et al.l 2004a 
where the prior is constructed via Bernstein polynomials. They also gave a 
posterior consistency result. Posteri or convergence ra t es can be obtained by 



a contiguity result as established in IChoudhuri et al. j2004bt . Their results 
state that for a Gaussian time series, the distributions of {Ui, . . . , Ui,) given by 
the exact and the Whittle likelihood are contiguous. As long as asymptotic 
convergence results are concerned, this allows us to work under the pretend- 
ing assumption that Ui,...,Uu are actually independent and exponentially 
distributed. 

Assume that the true spectral density /o G C°[0, 1] is positive throughout. 
A prior is constructed on / using a monotonic link function g. For example, if 
g is the identity function, then the prior for 6 can be constructed on (0, oo)"^ 
and f = fg is linear in 6. If g[x) = logx, then the prior for 6 is constructed 
on M"^. The posterior rates will be the same in both cases because the sieve in 
Theorem [2] will not be affected as long as M„ is bounded by a polynomial of 
n. Define a discretized L2-distance as follows: 



(6.1) 



i=l 



Then we have the following theorem. 



Theorem 5. Assume that the true spectral density /o G C°([0, 1]), takes value 
in a bounded interval [m, M] and a < q. Suppose that the prior is con- 
structed as in Section 3.2. Then the posterior distribution converges at rate 
en = ?7.~"/*^^""'"-^)(log?7,)^/^ with respect to dn. If a > 1, then the result also holds 
for the L2-distance. 



Proof. Use results in Section 7.3 of iGhosal and van der VaartI [2007b[, 
max{i.-^ ^ ir(P^,„ P;,), E ^(^/o.- ^ (/o, /) < ll/o - /IIL(6.2) 



Then u se the same arguments as in Theorem|H apply Theorem 4 of lGhosal and van der Vaart 
|2007b |. the conclusion holds. If a > 1, then /o is Lipschitz continuous. Hence 



II /i 

Lipschitz constant L. Therefore dn can be substituted by L2. 



■/2II2 < dnU\i f2) + {L + M)/n for any density functions /i, /2 < M having 

□ 



Remark 7. When a > 1, observe |/i(2j/n)-/2(2j/?2)| < |/i(x)-/2(x)|+4L/n 
for X G [2(j — l)/n, 2j/n) and j = 1, . . . , z/. By integrating and taking squares 
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of both sides, we have (^^(/i, /g) < ||/i - /2||i + L'^/rv' + ML/n. This imphes 
that we only need L2-approximation to get adaptive rates. Hence other basis 
expansion techniques that has an L2-approximation property can also be used 
to construct the Bayesian estimator for this model. 

6.2 Computation 

Choose the link function g[x) = x~^, the likelihood is 

V,^ "^P^- E Us/fiu^s)}. (6.3) 

We consider independent gamma prior distributions on 6: 6i '~ Gamma(aj, bi) 
for some positive numbers Oj and 6j. The posterior mean of g{fo) is given by 

E E • • • E / n n (^^) ^^pi- E E ^^^^(^^) - E b^^^}^^ 

j=l io=l i^=i"^^>Ofc=i V ^) s=0 s=l i=l i=l 

oo j ^ r ^ Qak-llfk ^ V 3 j ' 

E nO') E • • • E / n TfTT n ^^^B^^ (^^) e^p{- E E ^M^s) - E ^^^^>^^ 

j=l ji=l ii,=l '^^>0 A;=l ^ '^'^ s=l s=l i=l i=l 

Define 4j,o = Yli=oH^s = k} and = Z)s=i ^{^^ = The quantity 
above can be simplified into 



3 I' 



]=1 Jl=l li/ = l S=l fc=l ^ 'S— J. \ 



(Ofc) 



7 Nonparametric regression 
7.1 Regression with Gaussian errors 

We consider a nonparametric regression model with additive error Xi — f{Zi) + 

e.j, where ~ N{0,(j'^), and a is an unknown parameter. The covariates 
Zi, . . . , Zn can be either fixed or random. We first treat the fixed covariates 
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case. We will use the same spline-based prior on / as in Section 3. Also, we 
put a prior distribution on a that satisfies a tail condition for sufficiently 
small (T > and some positive number c: 

G'3M<exp{-cexp(l/a)}. (7.1) 

In practice, this condition is automatically satisfied if a has a confirmed lower 
bound. Define a empirical measure = ^^=i^Zi-, the covariance matrix 
E„^j, whose (fc, /)-th element is J BkBidF^ for k,l < J and || ■ ||2,n as the norm 
onL2(Pf). 

Denote the true value of the regression function as /q. We need the following 
assumptions on /o and Zi, - ■ ■ , Zn- 

(Dl) Smoothness: The function /o is assumed to be a-Holder, where < a < 
(D2) Separation property: 

e'^j:n,j0<J-'\\9f,0eR-'. (7.2) 

When the domain of the density is M, define a pseudo-metric || ■ II2 « (||/i — 
/2|l2,n)^ = la l/i " M'^^^^n similarly as in Section 5.2. 

Theorem 6. Suppose the true regression function /q satisfy conditions (Dl) 
and (D2), the covariates Zi, . . . , Zn are fixed and a prior is constructed as in 
Section 3.2 and (EI]). 

(1) If fo is defined on [0,1], then the posterior converges at the rate e„ = 

^~a/{2a+l) iQg^ relative to \\ ■ ||2,n- 

(2) If the support of fo is unbounded, e.g., the real line, then the posterior 
converges at the same rate with respect to \\ ■ 

Proof. We first consider part (1). Define Pf^j as the normal measure with 
mean f{Zi) and variance a^. Results in Birge |2006j imply that the like- 



lihood ratio test for fn v e rsus a nother / satisfy conditions in Lemma 2 of 



Ghosal and van der VaartI |2007b| with respect to || ■ ||2,„; see the latter paper 



for details. This implies that we can work on || ■ ||^ „ instead of Hellinger dis - 



tance. Using the arguments in Section 7.2 of iGhosal and van der VaartI |2007bl |. 

we get 

n n 

ma^{n~'J2K{Pf,„Pf,),n-'J2v{Pf,„Pf,)} < \\f, - ffja'. (7.3) 

4 = 1 1 = 1 
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Define a sieve by VVj^^Mn — 2^/ logn}. Using the fact tliat ||/o — /II2 « ^ 
ll/o ~ /IlL; tliere exists 60 G M"', ||^o||oo < 00 such that 



n] 



i=l 



1=1 



Under Condition (D2), for all ^i, 6*2 G 

\\fe,-f0j2,n<\\ei-e^\y^/j. 



(7.4) 



(7.5) 



Hence we can apply Theorem [2] for J„ = Cin^/^^a+i)^ M„ = Mq > ||/o||oo 
as a constant, e„, = g^^d €„. = n~"'^^'^"'^^^ logn. The conclusion then 



follows by Theorem 4 of iGhosal and van der VaartI |2007b| . The second part 
of the proof proceeds in the similar way as in Section 5.2. □ 

The above arguments apply for random covariates, too. Assume Zi, . . . , Z„ 
has a marginal distribution g, then in the posterior calculation, we can absorb 
g in the dominating measure and denote it as G. Then Theorem [6] holds for 
random covariates while || ■ ||2,n is replaced by || ■ ||2,n,G and || ■ is replaced 

by II ■ ll;,n,G- 

Remark 8. From (17. 3p . to get adaptation rate, we only need L2-approximation 
for the true regression function. This implies that other basis expansions that 
has an L2-approximation property can also be used for this model. 



7.2 Binary regression 

Bayesian methods have been used to study binary regression models for a while. 
The prior is commonly chosen as in duced from a Gaussian p rocess. A posterior 
consistency res ult was obtained bv iGhosal and Rovl 2006| while rates results 
were given by Ivan der Vaart and van ZantenI 2008|. Useful computational 
techn iques were developed by I Albert and ChibI |1993| and lRasmussen and Williams 

|2nn6) . 

Assume that we have n independent observations (Zi, Xi), . . . , (Z„, X„) 
from a binary regression model P(X = 1\Z = z) = 1— P(X = 0\Z = z) = fo{z), 
where X's take values in {0, 1} and Z's are either fixed or random covariates in 
some domain Z. Given a link function : 2 — )■ (0, 1), we can construct a prior 
on the regression function /o using spline functions as feiz) = '${6'^ B{z)}. 
The likelihood function for {Z,X) can be written as 



Lg{z,x) = feizni- fe{z)y-^g{z), 



(7.6) 
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where g is the marginal density of Z. When integrating out the posterior 
distribution, g will be canceled out. Therefore, in the following study, we can 
absorb g into the dominating measure, denoted it as G and remove g from 
the model dH]). Define pe = ^{6^3^(1 - ^{O^B))^-'' and pf^ = fS{l - 
/o)^~^, the following lemma states that if the link function satisfies certain 
conditions, then the the KL divergence can be controlled by the Euclidian 
dista nce (L2-distance and supremuni distance). Its proof is similar to Lemma 



3.2 of van der Vaart and van Zanten 2008 , and hence is omitted. 



Lemma 2. // ^ possesses a bounded derivative if) and the function — 

^)) is hounded, then for any 61,62 € M"^, ^7 > 1 and r > 1, we have the 
following: 

\\P0i - Pe^Wr ^ \\6l - 62\\r,G (7.7) 

me.x{K{pe,,Pj,),Vipe,,Pf,)} <\\^i6lB) - foWla- (7.8) 

In practice, some choices of link function \Ef can help us construct prior for 
6 in an easy way. For example, if we choose as the identity link, then the 
prior of 6 can be defined on (0, 1)"^. If we choose "^{z) = z/{l + z), then the 
prior of 6 can be defined on all positive numbers. 

Theorem 7. Suppose the function /o G C"(0, 1), a < q and the link func- 
tion has a bounded derivative ip while '?/'/(\E'(l — \I')) is bounded. Then 
the posterior distribution relative to the constructed prior converges at rate 
e„ = ?T,~"/*^^""''-'^^(logr;,)^/^ with respect to \\ ■ ||2,g distance. 

Proof. Since '?/'/(\E'(l — ^)) is bounded, pe is uniformly bounded below and 
above. Hence it is equivalent to work with || ■ ||2,g and Hellinger distance. Using 
Lemma[2]for r = 2 and Theorem[2]for J„ x n^/i^a+i)^ > max{^-^(/o), /o}, 
Zn = 77,~" /(2"+i) Q^Yid e = n~°'^^'^°''^^Hlo^nY^'^ , we can verify conditions in Theo- 



rem 4 of iGhosal and van der VaartI [2007bl | in a similar way with the proof of 



density estimation. □ 

Remark 9. The above arguments imply that L2-approximation is good enough 
to achieve adaptation. Hence we can also use other basis expansions that has 
an L2-approximation property. 

If /o is bounded away from and 1, using Lemma [H we can construct 
priors of 6 on (0, l)"'. This helps simplify the computation in a similar manner 
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with Section 5.3. Assume the identity hnk function and use Beta priors 9i '~ 
Beta(aj, hi) for some positive numbers and 6j, the posterior mean of fo{z) is 

j=l i=l ^ ^' ^ ^' io = l j„ = l"^° k=l s=0 



where Zq = z. Define 4j-o = SLo ^{^^ = ^} ^^"^ -^fc.i.i = Y.i=iHis = k}. 
The above equation can be simphfied into 



t=0,...,j t=0,...,j 



r(ai + bi) ^ TT D ^ V \ "TT ^("'^ + &fc + 4,j,i) 



t=i,...j- t=i,...,j 

it=k it=k 



7.3 Poisson regression 



Consider a Poisson regression model Xi '~ Poi(/(Zj)), where / is a unknown 
monotonic function and Zj's are covariates. For convenience, we assume Zj's 
are fixed here, the random covariates case can be treated similarly as in Sections 
7.1 and 7.2. Using a random series expansion, / can be modeled either as 
f[z) = 0'^B{z), where the prior of 6 is restricted on (0, cxo)"^ or through a link 
function f{z) = g{0'^B){z), e.g., choose g as the exponential link function and 
allow defined on R"'. 

The adaptation results can be obtained in a similar way with Section 7.1. 
Define an empirical measure = Yl^=i^Zi, the covariance matrix S„ .,-, 
whose (fc, /)-th element equals to / BkBidF^ and | | ■ \\ -?,^n as the norm on L^jF^) 



By applying the arguments in Section 7.1.1 of lGhosal and van der VaartI |2007bl |. 
we only need approximation results for L2-distance. The posterior rates theo- 
rem is stated as follows 

Theorem 8. Suppose the true function /q G C"(M) for some a < q and satis- 
fies L < fo < U for some constants L,U > 0. The priors are constructed as in 
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Section 3.2. Then the posterior converges at the rate en = n °/(2"+i)Qog?2)"^/^ 
relative to \\ ■ ||2,n- 

Remark 10. Since only L2-approximation is needed, we can use other basis 
functions that has an L2-approximation property as the expansion technique. 

If we choose the identity hnk and let 6i '~ Gamma(aj, 6j) for some positive 
numbers and bi, then the posterior mean of f{z) is 



j=i "'u f,^^ \ fcy 



] n 



n J 



i=l k=l 



k=l i=l 



n ] 



eM-Y,di{Y,B^{Zk) + h)}\{{Y,0iBi{Zu)}'''dO 



i=l k=l 



k=l i=l 



Denote Zq = z, Xq = 1. Define b = {bi, . . . , bj)"^ , s{k) as the fc-th element of 
vector s and B{zY = 11^=1 Bi{zy^ for some index vector s, the above quantity 
simplifies into 



5.- 3 



T{ak + s{k))b^ 



SQ,...,BneTi-'^ 

I sq I —^0 ' ■ ■ ■ ' I \—^n 
S = So-\ hSn 



i=0 * fc= 



-.(7.9) 



Eno-) E n 



Sl,...,STlfci^O 

|sit— ^l>--->lsn|— 
S = SiH hSri 



i=l 



n 



r(<n + s(k))v 



\nak){b, + Y.toBk{z,)Y''^^i^^ 



7.4 Functional regression model 



Spline functions are widely used to model functional data; see ICardot et al. 



m 



2003[ for example. An asy mptotic rates result of convergence was obtained 



Hall and Horowitz 



Goldsmith et al. 



2011 



2007[. A Bayesian method based on splines is given by 



However, to the best of our knowledge, no results on 
posterior convergence rates for this model are yet available. We consider two 
types of functional regression model. The first one assumes only the covariates 
Z{t) and the effects depend on time t. The second one allows functional 
observations X{t). 
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Suppose that the data that we observe are independent and identically 
distributed pairs (Zi, Xi), . . . , (Z„, X„), where each Z is a square integrable 
random function defined on the unit interval and X's are scalers. Assume 
the residuals (i.i.d) follow a normal distribution with mean and unknown 
variance o"^. A functional regression model can be formulated as follows: 

Xi= [ Z,{t)(3{t)dt + ei, (7.10) 
Jo 

where P{t) is the coefficient function we want to estimate. 
The following assumptions are needed: 

(El) Smoothness: The function P{t) is assumed to be a-Holder. 

(E2) Invertible condition: Assume EZ^(t) is continuous and positive for every 
tG [0,1]. 

Theorem 9. Suppose that the true regression function f3 satisfies Conditions 
(El) and (E2) and the prior is constructed as in Section 3.2 and (17. ip . Then, 
the posterior converges at the rate e„ = n~°'^'^'^°'^^'^ log n relative to || ■ ||2,Z; which 
IS defined as ||/|||^ = f [tf E{Z\t))dt. 

Proof. Consider a B-spline expansion /3(t) = X]fc=i ^fe-^fc(^)- Denote Wik = 
Zi{t)Bk{t)dt, then the model can be written as 

J 

X, = J2^kW,k + e,. (7.11) 

k=l 

Apply the same arguments as in Section 7.1, we can work on || ■ ||2,z instead 
of Hellinger distance. Given Z, define as the normal measure with mean 
Zi{t)l3{t)dt and variance a . This allows us to bound the KL divergences 
using Cauchy-Schwarz inequality: 

max{K{Pp„Pp),V{Pp„P^)} < ^Ez(^j^' Z{t){P{t)-Po{t)}dt^' 



1 



< -II/3-/30II2E / Z'it)dt 

< (7.12) 
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Hence we can apply Theorem [2] for J„ = Cirv'/^'^"^^\ Mn = Mq > ||/3o||oo as a 
constant, e„ = g^^d e„, = n""'^'-^"" ^ ^-' lo^n . Therefore the conditions 

in Theorem 4 of iGhosal and van der VaartI 2007bl | are verified. The proof is 
complete. □ 

Next, we consider a longitudinal type of functional model: 

X,(r,) = Z,(T,)/3(T,) + e(T,) (7.13) 

For each object i, we observe its response at a random time with some 
random covariate Zi. We assume Zj ~ Z, Ti ~ T and q ~ N(0, a^) for some 
finite probability distributions Z, T defined on the unit interval and some 
unknown variance a^. In order to obtain a similar argument with ( I7.12p . we 
need an additional condition: 

(E3) We assume T has a positive density function g{t) at [0, 1]. 
We have 

m^x{K{Pp„Pp),V{Pp„Pp)} < f Z\t){(5{t)-(5,{t)fg{t)dt 

^ Jo 



< ^,E{Z\t)}\\P - (7.14) 



By assuming the prior for cr satisfies (17.1 p and using similar arguments in 
Theorem |9l we have the convergence theorem. 

Theorem 10. Suppose the true regression function P{t) satisfy Conditions 
(E1)-(E3) and the prior is constructed as in Sections 3.2 and (17. ip . Then the 
posterior converges at the rate e„ = 77,~"/(2"+i) logn relative to \\ ■ \\2Z- 



This rate coincides with the optimal rate obtained in ICai and Yuan! |2011 



Remark 11. Note that the restriction of defining covariates on a compact 
interval can be dropped from Theorems [9] and [TO] as we can project the support 
of covariates from R to the unit interval by a link function. Then the results 
proceed in a similar way with Section 5.2. 

Remark 12. From (I7.12p and (I7.14p . we observe that only L2-approximation 
is needed. Hence other basis functions can be used as the expansion technique 
such as polynomial basis. 
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8 Extension to other function spaces 

So far, our discussions are restricted to functions that belong to Holder class. 
However, in applications, there is a lot of interest in studying a more general 
class of functions, e.g., Soblev and Besov spaces. We first give their definitions, 
then present the approximation results of B-splines in these function spaces. 

Definition 1. A Soblev space Lp[a, b] on a compact interval [a, b] for 1 < p < 

oo and a positive integer a is defined by: 

L;[a,b] = {/ : /(^^ e L,[a,b],j = 1, . . . ,a} (8.1) 
with an associated norm ||/||L«[a,b] = Zlj=o 

Definition 2. A Besov space Bp^^la, b] on [a, b] for I < p,q < oo and a > is 
the collection of all functions f such that 



oo 



dt 



1/9 



\f\B^,, = [ {t-"uJr{f,t),rj\ < OO, (8.2) 

where r > a is an integer and Ur{f,t) = sup|;j|<^ ll'^/^(/) OIIp ^■^ called the mod- 
ulus of smoothness of order r. 

If q = oo, then Ifls^^ is defined as sup^^Q t~°ci;r(/, t)p instead. Further 
define an associated norm on Bpg[a,b] as 

B^^^ = \\f\\p+\f\B^^^. (8.3) 



The following two lemmas are taking from Chapter 6 of Schumaker [20071 . 



They describe approximation abilities of B-splines in these function spaces, 
which directly determines the posterior rates of corresponding Bayesian models. 

Lemma 3. Suppose 1 < p < q < oo, for any < r < a — 1 and f G Lp[a, b], 
there exists a constant C and 6 G M"' such that 

In particular, if we choose r = and p = q, the above inequality simplifies into 



\\f-e'B\\p<cj- 
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Lemma 4. For any function f G Bp^[a, b], 1 < p < p' < oo and 1 < q,q' < oo, 
< r < [aj , there exists a constant C and G M"^ such that 



If we can bound the KL divergence by an appropriate norm in the new 
function spaces, then the posterior rates calculation will be very similar with 
what we have done in the previous discussion. For example, take p = 2 in 
Lemma [31 we can show that the posterior rate results in Theorem S] remains 
the same for /q G [a, h] and the use of log link function. This is true because 



h\hJe) <V{hJe) <\\\ogh - B 



125 



K{f,Je)<2h\f,Je) f < \\\og fo - 6^ B\\l (8.4) 

J0 



oo 



where fg = exp{6^ B} / J exp{6^B} is the approximating density function, 
which is lower bounded by a multiple of exp{— c||0||oo} for some positive con- 
stant c. Similar results apply for regression, spectral density models, too. 

For Soblev spaces, the restriction is that Lemma |3] only considers the case 
when the smoothness parameter a is an integer. It is not clear whether such 
approximation result remains valid for non- integer values of a. For Besov 
spaces, notice that there is an extra power in Lemma HJ where r is strictly 
greater than 0. This suggests only a sub-optimal rate e„ = n~°'^^'^"^^^^'^ (logn)'^ 
(r, c > 0) can be achieved by our methods, though r can be chosen arbitrarily 
close to 0. This is because we only need to bound the approximation error 
under Euclidean distances, extra approximation results on derivatives are not 
necessary. 



9 Numerical results 



We illustr ate the use of conjugate prior structure as described in Section 5.3. 
Following iLenkI 199l| , we generate 50 samples from a mixture density of ex- 
ponential and a normal distribution 



(9.1) 



We implement the random series prior using quadratic B-splines {q = 3) and 
choose a geometric prior Geo(.15) for J restricted between 5 and 12. The 
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lower truncation ensures a minimum number of terms in the series expansion 
while an upper truncation is necessary to carry out the actual computation 
using a computer. For 0, we use a Dirichlet distribution with parameters 
oi, . . . , aj = 1. Instead of iterating all 3^° possible permutation of indices to get 
equation I5.6[ we randomly sample = 1000 of them and take the associated 
average values. We obtained density estimates on 1000 grid points in the unit 
interval. The maximum Monte-Carlo standard error of the estimates is 0.12 
calculated by Delta method. The computation takes about 15 minutes on a 
2.20 GHz machine. In contrast, it takes 1 hour to run an RJMCMC method on 
the sa me problem. W e compare our results with the use of Gaussian process 



prior in iTokdaii [2007l |. Our method has a mean squared error (MSE) 0.076 to 
the true density while the MSE for Gaussian process prior is 0.111. Figured] 
shows that random series prior has a comparable performance with Gaussian 
process prior. 
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